2024-03-01
“First step towards a professional career in baseball analytics”
Where do you find baseball data?
What are useful tools for exploring this data?
Examples of baseball research?
3rd edition (with Max Marchi and Ben Baumer) available in a paperbook this summer
Online version available for free at http://tinyurl.com/abdwr3e
Chapters on sources of baseball data, sabermetrics, and coding using R
Using data to measure performance and facilitate decision making in sports
Baseball was one of the first sports to use data to address questions (sabermetrics)
Sports analytics is now applied in many sports
Bill James defined sabermetrics in 1980 as “the search for objective knowledge about baseball”
Moneyball (the book by Michael Lewis and the movie) describe the use of sabermetics by the Oakland Athletics in the 2002 season
All 30 MLB teams currently have analytics departments
Lahman database gives season to season data for all teams and players in baseball history
Retrosheet has game by game and play by play data for many season
Statcast is newest source of data – have data about pitches, balls put into play and player locations
Practically all of this data is publicly available
How to intelligently draft, sign and trade players?
How to measure performance?
How to use players in a game? (Defensive positioning, relief pitching)
Journals such as the Journal of Quantitative Analysis of Sports and the Journal of Sports Analytics
Conferences: Saberseminar, Carnegie-Mellon Sports Analytics Conference and the New England Symposium on Statistics in Sports
Blogs (Baseball Prospectus, FanGraphs, etc)
YES!
Practically all professional sports teams have analytics groups
I know people working for MLB, NFL, NHL teams
Companies such as Zelus Analytics “provide sports intelligence as a service to professional teams”
“The good receiver often makes many doubtful strikes pitches by catching the ball properly. This is not done by jerking or pulling the gall over the plate. Instead it is done by bringing all close pitches towards the belt buckle if they are just inside or outside of home plate … The entire active must be smooth if the umpire is to be deceived.”
From Power Ball: Anatomy of a Modern Baseball Game
Start at a 0-0 count (0 balls and 0 strikes)
Every pitch adds a strike or a ball
Possible counts: 0-0, 1-0, 0-1, 2-0, 1-1, 0-2, 3-0, 2-1, 1-2, 3-1, 2-2, 3-2
Three types of counts: Pitcher (like 1-2), Batter (like 3-1), and neutral (like 1-1)
Outcome of every pitch in a plate appearance gives an advantage to the pitcher or the batter
How do you measure this advantage?
State of an inning defined by the number of outs and the runners on base
There are 3 \(\times\) 8 = 24 possible inning states.
For each state, define “Runs Expectancy” to be the expected number of runs scored in the remainder of the inning.
Compute using data for a particular season
Measure by the Runs Expectancy Matrix
Look at the Runs in the “before” and “after” states
Runs Value = \(RE24_{after} - RE24_{before} + Runs \, Scored\)
Value of Starting State: 0.50
Value of Ending State: 0.25
Two runs scored on play
Value of Home Run is \[Value = 0.25 - 0.50 + 2 = 1.75 \, \, Runs\]
Value of Starting State: 0.50
Value of Ending State: 0.66
No runs scored on play
Value of Stolen Base is\[ Value = 0.66 - 0.50 + 0 = 0.16 \, \, Runs\]
Runs value of 1-2 count?
Look at all plate appearances that pass through a 1-2 count
Average the runs in the remainder of the inning for all these plate appearances
Repeat this process for all possible counts, and graph
Bill James found an empirical relationship between R / RA and W / L
Pythagorean formula\[ \frac{W}{L} = \left(\frac{R}{RA}\right)^k \]
A contribution of 10 more runs is equivalent to contributing one win for the team
Each additional strike contributes runs to the defensive team
Each contribution is small, but the cumulative effect of many added strikes is large
Convert the runs contributed to wins
Pitches are thrown towards a “strike zone”
Pitches where the batter doesn’t swing are called “strikes” or “balls” by the umpire
Pitches landing inside zone should be called strikes
The umpire
The batter
The pitcher
The catcher
Other influences?
Data: all called pitches in 2016 season
Response: \(y\) (1 or 0) (Strike or Ball)
Input: (platex, platez) - location of pitch
Let \(p = P(y = 1) = Prob(Strike)\)
Fit a generalized additive model\[ \log \left(\frac{p}{1-p}\right) = s(platex, platez) \] where \(s()\) is a smooth function of the location variables
Actual strike zone is defined where \[p = P(Strike) = 0.5\]
Location of the actual strike zone depends on the count
Look at actual zone at a 0-0 (Neutral) count
Compare with the actual zone on a 0-2 (Pitchers) count
Catcher can influence the called pitch
Subtle way the ball is caught
How do you measure it?
How big an effect is it?
Outcome - called pitch (strike or ball)
Inputs:
Generalized additive model\[ \log \left(\frac{p}{1-p}\right) = s(platex, platez) + p_{j(i)} + b_{k(i)} + u_l(i) + ca_{m(i)} \]
Each set of random effects assigned normal prior with unknown standard deviation
Catcher framing estimates are {\(ca_j\)}
Convert these to strikes added and runs saved
Data science coursework (learning R, exploring data, modeling)
Get connected with a college sports team (baseball, basketball, soccer, volleyball)
Work on small projects and publicize your work (on a blog)
People working in field come from wide variety of academic background but data science is a great background
Teams want people who are able to pose good questions and follow an analysis from beginning to end
Communication skills important
Get started through an internship for a sports team (Saberseminar)